System and method to convert lock-free algorithms to wait-free using a hardware accelerator

ABSTRACT

A method to convert lock-free algorithm to wait-free using a hardware accelerator includes (i) executing a plurality of software threads by a plurality of processing units associated, the plurality of software threads is associated with at least one operation, (ii) generating at least one of a read request or a write request at the hardware accelerator based on the execution, (iii) generating at least one operation includes PARAM and read request or the write request at the hardware accelerator, (iv) checking, an operation specific condition of at least one software thread of the plurality of software threads, and (v) updating, at least one read value or write value and at least one state variable upon the operation specific condition being an operation success. The operation specific condition includes an operation success or an operation failure based on the PARAM, the read request, or the write request.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Indian patent application no.348/CHE/2014 filed on Jan. 27, 2014, the complete disclosure of which,in its entirely, is herein incorporated by reference.

BACKGROUND

1. Technical Field

The embodiments herein generally relate to hardware accelerators and,more particularly, to a system and a method to convert lock-freealgorithms to wait-free algorithms using hardware accelerators.

2. Description of the Related Art

Multithreading is a process of executing multiple software threadssimultaneously and is indicative of an ability of a program or anoperating system process to manage its use by more than one user at atime and to also manage multiple requests by the same user without theneed of causing multiple copies of the software program to run in thecomputer. Typically, central processing units (CPUs) have hardwaresupport to efficiently execute multiple software threads simultaneously.However, CPUs enabled with the multithreading capabilities aredistinguished from multiprocessing systems (such as, multi-core systems)in requiring sharing of one or more resources of a single core includingcomputing units, a CPU cache and a translation look aside buffer (TLB)for enabling simultaneous execution of multiple software threads. Mostof multiprocessing systems use a variety of techniques to ensureintegrity of shared data, the techniques including for example, lockingmechanisms, software (SW) based lock-free algorithms, hardware (HW)assisted lock free algorithms, transactional memory, and the like. Thetypical sequence for implementing the lock-free algorithm includesreading a value from a store, performing a set of operations withcomputation and performing condition checks involving read/write value(VALUE), a parameter and/or state variables (STATE) in the store. If theoperation succeeds, then VALUE and STATE is updated in the store andVALUE is returned else the request fails.

Apart from the operation specific condition check failing, failure canalso happen due to atomicity being violated i.e. multiple threads try toexecute the above sequence with at least one of the steps overlapping intime. The atomicity violation problem also leads to one attemptsucceeding and all other subsequent attempts failing. When there is afailure due to atomicity violation, an application thread is expected toretry the operation (OPn) and hence the operation is not “wait-free”.Depending on a prioritization, design and state of the system, multipleattempts have to be made before an attempt succeeds. The above approachmakes timing requirement of the system unpredictable and therefore theapproach may not be suitable for use in systems requiring deterministicbehavior. Detection of atomicity violation is often performed usingvalue of a location. For example if a thread read ‘A’ as the value fromthe location and need to update it to ‘N’ it may issue an atomic Compareand Swap (CAS) instruction which can update the ‘location’ to ‘N’ ifstill holds ‘A’ but fails if the location contains any other value (dueto another thread updating the value). But checking that the locationstill has ‘A’ does not mean it has not been updated, the location couldhave been changed from ‘A’ to say ‘B’ and then back to ‘A’ by one ormore other threads. This scenario is termed the ‘ABA’ hazard which leadsto incorrect results. Typical implementations of lock-free algorithmssuffer from this.

Eliminating hazards like the ABA problem further complicateimplementation requiring additional overhead with Compare and Swap (CAS)and extremely conservative approach with LL/SC (Load Link/StoreConditional) in determining atomicity violation (mostly due to thehigher cost of accurate determination), leading to atomicity failureseven in cases where it would have been safe for the operation tosucceed. The wait-free algorithms can be created for certain structures,but their performance is worse than lock-free or even lock-basedapproaches. In some cases they also require memory proportional to thenumber of application threads. Accordingly, there remains a need for anefficient system to reduce the problem of atomicity, the ABA hazard thatfacilitates ensuring integrity of shared data.

SUMMARY

In view of the foregoing, an embodiment herein provides a method ofconverting a lock-free algorithm to wait-free algorithm with thehardware accelerator. The method include (i) executing a plurality ofsoftware threads by a plurality of processing units associated, theplurality of software threads is associated with at least one operation,(ii) generating at least one of read request or write request at thehardware accelerator based on the execution, (iii) generating at leastone operation include PARAM and read request or write request at thehardware accelerator, (iv) checking, an operation specific condition ofat least one software thread of the plurality of software threads, and(v) updating, at least one of read value or write value and at least onestate variable upon the operation specific condition being an operationsuccess. The plurality of processing units being communicativelyassociated with the hardware accelerator. The one or more operation isone of the read request or the write request. The hardware acceleratoris associated with a plurality of buses. The hardware accelerator isaccessible to the plurality of software threads associated with theplurality of processing units as a memory mapped device mapped into apre-determined physical address range of each of the plurality of busesfor ensuring contention resolution among the plurality of buses. Theoperation specific condition includes an operation success or anoperation failure based on at least one of the PARAM, the read request,or the write request.

The method may further include performing prior to checking theoperation specific conditions (i) the one or more operations, and thedevice address associated with the read request is encoded to obtain anencoded data, and (ii) at least one of a failure value or a successvalue of the one or more operations is returned from the hardwareaccelerator to the plurality of software threads on a plurality of datalines associated with the pre-determined physical address range. Theencoded data is communicated to the hardware accelerator by theplurality of software threads executed by the plurality of processingunits. The lock free algorithm is partitioned into the software and thehardware. The encoded data is passed from the software to the hardwareand obtaining return encoded data from the hardware.

The method may further include performing prior to checking theoperation specific conditions, the one or more operation, the PARAM, thedevice address, and plurality of data lines associated with the writerequest is encoded to obtain an encoded data. The encoded data iscommunicated to the hardware accelerator by the plurality of softwarethreads executed by the plurality of processing units. The lock-freealgorithm is partitioned into the software and the hardware. The encodeddata is passed from the software to the hardware. In one embodiment, acontention within each of the plurality of buses is resolved through oneof an arbitration protocol and a starvation free priority resolutiontechnique.

The one or more operation and the PARAM may be encoded as a leastsignificant bit of the encoded data. The steps of checking operationspecific condition and updating may be performed by the hardwareaccelerator. The steps of encoding and returning may be performed by thehardware accelerator. The pre-determined physical address rangeassociated with each of the plurality of buses may be associated with atleast one processing unit of the plurality of processing units. Themethod may further include the one or more operation, the device addressand a memory address location of the PARAM may be encoded for generatingthe encoded data, upon size of the PARAM exceeding a pre-allocatednumber of bits for the PARAM in the encoded data.

The memory address location may correspond to a pre-allocated memory forthe PARAM. The pre-allocated memory may be allocated proportional to anumber of concurrent requests during execution of the plurality ofsoftware threads by the hardware accelerator at any predeterminedinstance of time. The method may further include at least one of (a)masking at least one interrupt on a processing unit from among theplurality of processing units being accessed by the hardwareaccelerator, (b) writing into the pre-allocated memory for the PARAMreserved for the processing unit, (c) performing a read or writeoperation to the hardware accelerator and passing the pre-allocatedmemory as PARAM for the encoding and (d) unmasking the masked interrupt.The method may further include allocating the pre-allocated memory forthe PARAM based on a circular queue which includes at least one of (i)reading a dedicated hardware accelerator to obtain a pre-allocatedmemory for the PARAM, (ii) writing into the pre-allocated memory for thePARAM reserved for the processing unit, (iii) performing, a read orwrite operation to the dedicated hardware accelerator and passing thepre-allocated memory as PARAM, and writing the pre-allocated memory intothe dedicated hardware accelerator to release the pre-allocated memory.The dedicated hardware accelerator may be dedicated for PARAM memoryallocation.

In one aspect, a hardware accelerator includes a dedicated digitallogical circuit and memory storing at least one VALUE and at least oneSTATE is provided. The dedicated digital logical circuit is configuredto (i) process at least one of the read request or write request at thehardware accelerator upon execution of a plurality of software threadsby a plurality of processing units associated, the plurality of softwarethreads is associated with at least one operation, (ii) process at leastone operation include PARAM and read request or write request at thehardware accelerator, (iii) check an operation specific condition of atleast one software thread of the plurality of software threads, and (iv)update at least one of: at least one read VALUE or write VALUE and atleast one STATE variable upon the operation specific condition being anoperation success. The operation specific condition includes anoperation success or an operation failure based on at least one of thePARAM, the read request, or the write request. The hardware acceleratoris associated with a plurality of buses. The hardware accelerator isaccessible to the plurality of software threads associated with theplurality of processing units as a memory mapped device mapped into apre-determined physical address range of each of the plurality of busesfor ensuring contention resolution among the plurality of buses.

The hardware accelerator may be further configure to, perform prior tochecking the operation specific conditions (i) decode the at least oneof operation, and the device address associated with the read request toobtain an encoded data, and return at least one of a failure value or asuccess value of the at least one operation from the hardwareaccelerator to the plurality of software threads on a plurality of datalines associated with the pre-determined physical address range. Thelock free algorithm may be partitioned into the software and thehardware. The encoded data may be passed from the software to thehardware and obtaining return encoded data from the hardware. Theencoded data may be communicated to the hardware accelerator by theplurality of software threads executed by the plurality of processingunits. The hardware accelerator may be further configured to, performprior to checking the operation specific conditions (i) decode the atleast one of operation, the PARAM, the device address, and plurality ofdata lines associated with the write request to obtain an encoded data.The encoded data may be communicated to the hardware accelerator by theplurality of software threads executed by the plurality of processingunits. The lock-free algorithm may be partitioned into the software andthe hardware. The decoded data may be passed from the software to thehardware. A contention within each of the plurality of buses may beresolved through one of an arbitration protocol and a starvation freepriority resolution technique.

The one or more operation and the PARAM may be encoded as a leastsignificant bit of the encoded data. The pre-determined physical addressrange associated with each of the plurality of buses may be associatedwith at least one processing unit of the plurality of processing units.The hardware accelerator may be further configured to, decode the atleast one of operation, the device address and a memory address locationof the PARAM for generating the encoded data, upon size of the PARAMexceeding a pre-allocated number of bits for the PARAM in the encodeddata. The memory address location corresponds to a pre-allocated memoryfor the PARAM. The hardware accelerator may be further configured toupon receiving a read operation or write operation to the hardwareaccelerator passing the pre-allocated memory as PARAM for the encoding,performs a read operation for retrieving the pre-allocated memory anduse its contents as PARAM for the requested operation. The hardwareaccelerator may be further configured to allocate the pre-allocatedmemory for the PARAM based on a circular queue include (i) read adedicated hardware accelerator to allocate a pre-allocated memory forthe PARAM, and write the pre-allocated memory into the dedicatedhardware accelerator to release the pre-allocated memory. The dedicatedhardware accelerator may be dedicated for PARAM memory allocation.

In another aspect, a hardware accelerator includes a processor andmemory storing instructions to execute the processor is provided. Thememory storing at least a VALUE and a STATE. The processor is configuredto (i) process at least one of the read request or write request at thehardware accelerator upon execution of a plurality of software threadsby a plurality of processing units associated, the plurality of softwarethreads is associated with at least one operation, (ii) process at leastone operation include PARAM and read request or write request at thehardware accelerator, (iii) check an operation specific condition of atleast one software thread of the plurality of software threads, and (iv)update at least one of: at least one read VALUE or write VALUE and atleast one STATE variable upon the operation specific condition being anoperation success. The hardware accelerator is associated with aplurality of buses. The hardware accelerator is accessible to theplurality of software threads associated with the plurality ofprocessing units as a memory mapped device mapped into a pre-determinedphysical address range of each of the plurality of buses for ensuringcontention resolution among the plurality of buses. The operationspecific condition includes an operation success or an operation failurebased on at least one of the PARAM, the read request, or the writerequest.

These and other aspects of the embodiments herein will be betterappreciated and understood when considered in conjunction with thefollowing description and the accompanying drawings. It should beunderstood, however, that the following descriptions, while indicatingpreferred embodiments and numerous specific details thereof, are givenby way of illustration and not of limitation. Many changes andmodifications may be made within the scope of the embodiments hereinwithout departing from the spirit thereof, and the embodiments hereininclude all such modifications.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments herein will be better understood from the followingdetailed description with reference to the drawings, in which:

FIG. 1 is a system view illustrates a plurality of software thread (SW)within a plurality of processing units interfacing with a hardwareaccelerator to convert lock-free algorithms to wait-free algorithmsaccording to an embodiment herein;

FIG. 2A is an exemplary view of address mapping of the hardwareaccelerator of FIG. 1 according to an embodiment herein;

FIG. 2B is an exemplary view of encoding at least one operation (OPn),PARAM in the plurality of software threads (SW) and communicating to thehardware accelerator of FIG. 1 according to an embodiment herein;

FIG. 3A is a flow diagram illustrating a method of allocating aPARAM_MEMORY within the plurality of processing units of FIG. 1according to an embodiment herein;

FIG. 3B is a flow diagram illustrating a method of allocating aPARAM_MEMORY using a circular queue according to an embodiment herein;

FIG. 4 is a flow diagram illustrating a method of converting a lock-freealgorithms to wait-free algorithm with the hardware acceleratoraccording to an embodiment herein;

FIG. 5 illustrates a schematic diagram of computer architectureaccording to an embodiment herein; and

FIG. 6 is a flow diagram illustrating wait-free algorithm operationswith the hardware accelerator according to an embodiment herein.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The embodiments herein and the various features and advantageous detailsthereof are explained more fully with reference to the non-limitingembodiments that are illustrated in the accompanying drawings anddetailed in the following description. Descriptions of well-knowncomponents and processing techniques are omitted so as to notunnecessarily obscure the embodiments herein. The examples used hereinare intended merely to facilitate an understanding of ways in which theembodiments herein may be practiced and to further enable those of skillin the art to practice the embodiments herein. Accordingly, the examplesshould not be construed as limiting the scope of the embodiments herein.

Various embodiments of the methods and systems disclosed herein providean efficient technique to reduce problem of atomicity and ABA hazard soas to ensure integrity of shared data by providing specific partitioningand interfacing between software and hardware designed to eliminateatomicity violations. In an embodiment, a method for convertinglock-free algorithms to wait-free algorithms with a hardware acceleratoris provided. The hardware accelerator stores read/write values (VALUE)and state variables (STATE) to perform a set of operation (OPn) requiredfor a lock-free algorithm. In addition, the hardware acceleratorperforms one or more computations, condition checks and updates VALUEand STATE. In an embodiment, the hardware accelerator is accessible tothe software (SW) as a memory mapped device, mapped into pre-determinedphysical address range of each bus. Referring now to the drawings, andmore particularly to FIGS. 1 through 6, where similar referencecharacters denote corresponding features consistently throughout thefigures, preferred embodiments are shown.

FIG. 1 is a system view illustrates a plurality of software threads (SW)102A-N within plurality of processing units 104A-N interfacing with ahardware accelerator 108 to convert lock-free algorithms to wait-freealgorithms according to an embodiment herein. The system view 100includes the plurality of software threads (SW) 102A-N, the plurality ofprocessing units 104A-N, and a plurality of buses 106A-N associated withthe hardware accelerator 108. In one embodiment, the plurality ofsoftware threads (SW) 102A-N is executed by the plurality of processingunits 104A-N (e.g., CPU set A 104A, CPU set B 104B, CPU set N 104N, asused herein the term “CPU set” is construed as referring to a pluralityof processing units). The plurality of software threads (SW) 102A-N isassociated with one or more operations. In one embodiment, the pluralityof processing units 104A-N is communicatively associated with thehardware accelerator 108. In one embodiment, the one or more operationsinclude, for example a read request or a write request.

The hardware accelerator 108 stores VALUE (e.g., read/write values)and/or STATE (e.g., state variables) and to perform a set of operations(OPn) required for a lock-free algorithm, including (i) computations,(ii) condition checks and (iii) updation of VALUE and STATE. Thehardware accelerator 108 includes a dedicated digital logical circuitand memory storing at least one VALUE and at least one STATE. Thehardware accelerator 108 is accessible to the one or more softwarethreads (SW) 102A-N as a memory mapped device. The hardware accelerator108 is mapped into a pre-determined physical address range of each ofthe one or more bus for ensuring contention resolution among theplurality of buses 106A-N. In one embodiment, each bus is associatedwith one or more CPUs and one or more software threads executing on anyof at least one CPUs is able to interact with the hardware accelerator108 by issuing a read/or write request. In one embodiment, the set ofoperations are for example, OP1 (VALUE, PARAM, STATE), OP2 (VALUE,PARAM, STATE), OPn (VALUE, PARAM, STATE). The hardware accelerator 108generates one or more read request or write requests based on theexecution. The hardware accelerator 108 generates the one or moreoperations including PARAM and read request or write request. In anembodiment, the hardware accelerator 108 checks an operation specificcondition of at least one software thread of the plurality of softwarethreads (SW) 102A-N. In one embodiment, the operation specific conditionincludes an operation success or an operation failure based on at leastone of the PARAM, the read request, or the write request. The hardwareaccelerator 108 updates at least one read value, write value and/or atleast one state variable upon the operation specific condition being anoperation success.

FIG. 2A is an exemplary view 200A of an address mapping of the hardwareaccelerator 108 of FIG. 1 according to an embodiment herein. Theexemplary view 200A includes a plurality of addresses 202A-N associatedwith the plurality of processing units 104A-N. In one embodiment, oneach bus, the hardware accelerator 108 is addressable through a uniqueaddress range (unique to each bus). For example, on a 32 address bitbus, the hardware accelerator 108 is mapped to address range 0x4400 0000to 0x44FF FFFF. Any read/write access request by a software (SW) runningon a CPU set A 104A (which is connected to BUS A) to an address between0x4400 0000 to 0x44FF FFFF (both inclusive) is routed to the hardwareaccelerator 108. The hardware accelerator 108 processes the request andresponds as per a specific BUS protocol.

FIG. 2B is an exemplary view 200B of encoding at least one operation(OPn), PARAM in the plurality of software threads (SW) 102A-N andcommunicating to the hardware accelerator 108 of FIG. 1 according to anembodiment herein. The exemplary view 200B includes a device address MSB204A, an at least one operation OPn 204B (VALUE, PARAM, and STATE), anda PARAM 204C. In one embodiment, the OPn, PARAM are encoded in theplurality of software thread (SW) 102A-N and communicated to thehardware accelerator 108 for partitioning a lock-free algorithm intosoftware (SW) and hardware (HW). In one embodiment, an address range forthe hardware accelerator 108 is selected to be large enough to encodeand pass additional information (e.g., OPn and PARAM in the LSBs). Forexample, address mapping, 8 bits of the MSB are used for addressing, theremaining 24 bits available have been used as 4 bits for the OPn 204Band 20 bits for the PARAM 204C. If the operation OPn 204B returns with aresult that is be mapped as a read request to the hardware accelerator108. The result are returned on a data lines, otherwise it's mapped as awrite request. In case of a write, the PARAM 204C may also be passed onthe data lines in addition to an encoded field in the address. Aspecific FAILURE VALUE may indicate failure or a specific bit in theresult that encode success or failure status.

A specific BUS implementation may have arbitration/scheduling policy toserialize accesses of the hardware accelerator 108. For example, if asoftware thread (SW) on CPU1 and a software thread (SW) on CPU2, bothconnected to BUS1 which access the hardware accelerator 108. Then, theCPU1 may be serialized as first, CPU2 second or CPU2 first, CPU1 secondas per BUS implementation. The hardware accelerator 108 mayindependently receive a request on each BUS, and if more than onerequest is received at the same time which leads to a contention. In oneembodiment, the contention across the bus may be resolved by thehardware accelerator 108. The contention resolution may be performedbased on any starvation free priority resolution methods. For example,the contention resolution method may be round robin, where the pluralityof buses 106A-N are serviced in a fixed repeating sequence say A, B, . .. N and again A, B, . . . N and so on. For example, once a specificrequest is selected, the hardware accelerator 108 may perform theselected OPn 204B. The selected OPn 204B may return result and updatesVALUE, STATE on success or indicating failure. The hardware accelerator108 may then move on to process the next request, in one exampleembodiment.

In one embodiment, a memory extension is performed by using anadditional memory, when encoding of the PARAM is not possible into oneor more bits available in the address. The additional memory requiredmay be in proportion with a maximum number of concurrent requests, whichare required for executing multiple software threads. A request may bemade on the hardware accelerator 108 at any time and may be less than anumber of software threads in a system.

For example, before making a request the software thread first allocatesPARAM_MEMORY_i, then writes the PARAM 204C into a memory reserved forPARAM_MEMORY_i (e.g. may be arbitrarily large) and then just passes ‘i’to the hardware accelerator 108. When a request is selected forprocessing, the hardware accelerator 108 may first fetch the PARAM 204Cfrom location associated with the PARAM_MEMORY_i and then processed. Inone example embodiment, the PARAM_MEMORY allocation is designed to be“wait-free” and hence may be “wait-free” with the PARAM memoryextension.

FIG. 3A is a flow diagram 300A illustrating a method of allocating aPARAM_MEMORY within one or more processing units, such as the pluralityof processing units 104A-N of FIG. 1 according to an embodiment herein.In step 302A, one or more interrupts are masked on a current CPU (e.g.,in the CPU set A 104A also denoted as CPU ‘i’). In step 304A, a writeoperation is performed into a PARAM_MEMORY_i (e.g., a pre-allocatedmemory) which is reserved for the CPU ‘i’. In step 306A, a requiredread/write operation is performed. In step 308A, pass ‘i’ as the PARAM.In step 310A, one or more masked interrupts are unmasked. In oneembodiment, masking of the interrupts for short duration of a bounded“Write” and “Read” operation may not affect the performance of a system.

In one embodiment, if a system already is upper bounded on a maximumnumber of concurrent requests which can be made at any time and if thebound is lesser than a number of CPUs then a dynamic PARAM_MEMORYallocation using a Circular Queue is implemented. The Circular Queue maybe another “wait-free” Circular Queue implementation using anotherhardware accelerator (e.g., no PARAM is required for implementing aCircular Queue). Further, one or more hardware accelerators may beconnected to directly free the PARAM_MEMORY allocation after PARAM read,in one embodiment.

FIG. 3B is a flow diagram 300B illustrating a method of allocating aPARAM_MEMORY using a circular queue according to an embodiment herein.In step 302B, the PARAM_CQUEUE hardware accelerator (e.g., a dedicatedhardware accelerator) is read to obtain a PARAM_MEMORY_i. In step 304B,the PARAM is written into a PARAM_MEMORY_i. In step 306B, a requiredread/write operation is performed. In step 308B, pass ‘i’ as a PARAM.The PARAM_MEMORY_i is unbound after PARAM read based on Writing ‘i’ toPARAM_CQUEUE HW accelerator.

In one embodiment, a “wait-free” circular buffer can be implemented. Thehardware accelerator 108 is initialized and SIZE of a circular buffer isfixed. In one example embodiment, when the software thread wants toallocate space for writing, then the software thread may useOP0=write_start with PARAM=length of buffer to be allocated. Theoperation is mapped as a read to the hardware accelerator 108. Theoperation “Result” is returned as the read value. When the operationsucceeds, “Result” may be between 0 to SIZE-1 and the location betweenResult to (Result+length) modulo SIZE may be written. Similarly, whenthe operation fails (due to lack of space), “Result” may be “SIZE”.

In another example embodiment, when an software thread finishes writingto the allocated space and wants to indicate write completion then mayuse OP1=write_done with PARAM=(Result, Length) returned by correspondingsuccessful write_start. Then the operation is mapped as a write to thehardware accelerator 108 (e.g., no return value for this operation).

In yet another example embodiment, when a software thread wants to readfrom circular buffer then may use OP2=read_start with PARAM=length ofbuffer required to read. The operation is mapped as a read to thehardware accelerator. The operation “Result” is returned as the readvalue. When the operation succeeds, “Result” may be between 0 to SIZE-1and the location between “Result” to (Result+length) modulo SIZE may beread. Similarly, when the operation fails (not enough items to read),“Result” may be “SIZE”. In yet another example embodiment, when ansoftware thread may finish reading and wants to indicate read completionthen may use OP3=read_done, with PARAM=(Result, Length) returned bycorresponding successful read_start. The operation is mapped as a writeto the hardware accelerator (no return value for this operation).

FIG. 4 is a flow diagram 400 illustrating a method of convertinglock-free algorithms to wait-free algorithm with the hardwareaccelerator 108 according to an embodiment herein. In step 402, theplurality of software threads 102A-N associated with OPn (VALUE, PARAM,and STATE) is obtained as an input. In step 404, check for operation(OPn) specific condition (e.g., OPn success or OPn specific failure) isperformed. In step 406, update VALUE, STATE if not updated by anothersoftware thread which indicates status as “OPn success” else indicatesstatus as “OPn failure” and return to the step 402. In one embodiment,the updation of VALUE, STATE is performed at a central processing unit(CPU)/hardware (HW) assisted (e.g. CAS, LL/SC). In step 408, theoperation status is indicated as “OPn success”. In step 410, enablespartitioning of hardware and software. In one embodiment, an OPn, PARAMread/write is inputted to the hardware accelerator 108. In step 412,check for OPn specific condition (e.g., OPn success or OPn specificfailure) is performed. In step 414, update VALUE, STATE and indicatestatus as “OPn success” else indicates status as “OPn failure” andreturn value/status to the software thread 102A-N. In one embodiment, alock-free algorithm is mapped into steps 402 to 408 and then steps 410to 414 that indicate the lock-free algorithm is converted to wait-freealgorithm with the hardware accelerator 108.

The embodiments herein can take the form of, an entirely hardwareembodiment which includes a dedicated digital logical circuit, anentirely software embodiment or an embodiment including both hardwareand software elements. The embodiments that are implemented in softwareinclude but are not limited to, firmware, resident software, microcode,etc. Furthermore, the embodiments herein can take the form of a computerprogram product accessible from a computer-usable or computer-readablemedium providing program code for use by or in connection with acomputer or any instruction execution system. For the purposes of thisdescription, a computer-usable or computer readable medium can be anyapparatus that can comprise, store, communicate, propagate, or transportthe program for use by or in connection with the instruction executionsystem, apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic,infrared, or semiconductor system (or apparatus or device) or apropagation medium. Examples of a computer-readable medium include asemiconductor or solid state memory, magnetic tape, a removable computerdiskette, a random access memory (RAM), a read-only memory (ROM), arigid magnetic disk and an optical disk. Current examples of opticaldisks include compact disk-read only memory (CD-ROM), compactdisk-read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing programcode will include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code in order to reduce the number of times code must beretrieved from bulk storage during execution.

Input/output (I/O) devices (including but not limited to keyboards,displays, pointing devices, remote controls, etc.) can be coupled to thesystem either directly or through intervening I/O controllers. Networkadapters may also be coupled to the system to enable the data processingsystem to become coupled to other data processing systems or remoteprinters or storage devices through intervening private or publicnetworks. Modems, cable modem and Ethernet cards are just a few of thecurrently available types of network adapters.

A representative hardware environment for practicing the embodimentsherein is depicted in FIG. 5. This schematic drawing illustrates ahardware configuration of an information handling/computer system inaccordance with the embodiments herein. The system comprises at leastone processor or central processing unit (CPU) 10. The CPUs 10 areinterconnected via system bus 12 to various devices such as a randomaccess memory (RAM) 14, read-only memory (ROM) 16, and an input/output(I/O) adapter 18. The I/O adapter 18 can connect to peripheral devices,such as disk units 11 and tape drives 13, or other program storagedevices that are readable by the system. The system can read theinventive instructions on the program storage devices and follow theseinstructions to execute the methodology of the embodiments herein.

The system further includes a user interface adapter 19 that connects akeyboard 15, mouse 17, speaker 24, microphone 22, and/or other userinterface devices such as a touch screen device (not shown) or a remotecontrol to the bus 12 to gather user input. Additionally, acommunication adapter 20 connects the bus 12 to a data processingnetwork 25, and a display adapter 21 connects the bus 12 to a displaydevice 23 which may be embodied as an output device such as a monitor,printer, or transmitter, for example.

FIG. 6 is a flow diagram 600 illustrating a method of wait-freealgorithm operations with the hardware accelerator 108 according to anembodiment herein. In step 602, a plurality of software threads 102A-Nis executed by a plurality of processing units 104A-N associated and theplurality of software threads 102A-N is associated with at least oneoperation. In step 604, at least one of read request or write request isgenerated at the hardware accelerator 108 based on the execution. Instep 606, at least one operation is generated includes PARAM and readrequest or write request at the hardware accelerator 108. In step 608,an operation specific condition of at least one software thread of theplurality of software threads 102A-N is checked. In step 610, at leastone of read VALUE or write VALUE and at least one STATE variable isupdated upon the operation specific condition being an operationsuccess.

The plurality of processing units 104A-N is communicatively associatedwith the hardware accelerator 108. The one or more operation is one of aread request or a write request. The hardware accelerator 108 isassociated with a plurality of buses 106A-N. The hardware accelerator108 is accessible to the plurality of software threads 102A-N associatedwith the plurality of processing units 104A-N as a memory mapped devicemapped into a pre-determined physical address range of each of theplurality of buses for ensuring contention resolution among theplurality of buses 106A-N. The operation specific condition include anoperation success or an operation failure based on at least one of thePARAM, the read request, or the write request.

The method may further include performing prior to checking theoperation specific conditions (i) the one or more operations, and thedevice address associated with the read request is encoded to obtain anencoded data, and (ii) at least one of a failure value or a successvalue of the one or more operations is returned from the hardwareaccelerator 108 to the plurality of software threads 102A-N on aplurality of data lines associated with the pre-determined physicaladdress range. The encoded data may be communicated to the hardwareaccelerator 108 by the plurality of software threads 102A-N executed bythe plurality of processing units 104A-N. In one embodiment, the lockfree algorithm is partitioned into the software and the hardware. Theencoded data is passed from the software to the hardware and obtainingreturn encoded data from the hardware.

The method further include performing prior to checking the operationspecific conditions, the one or more operation, the PARAM, the deviceaddress, and plurality of data lines associated with the write requestis encoded to obtain an encoded data. The encoded data is communicatedto the hardware accelerator 108 by the plurality of software threads102A-N executed by the plurality of processing units 104A-N. Thelock-free algorithm is partitioned into the software and the hardware.The encoded data is passed from the software to the hardware. In oneembodiment, a contention within each of the plurality of buses 106A-N isresolved through one of an arbitration protocol and a starvation freepriority resolution technique.

The one or more operation and the PARAM may be encoded as a leastsignificant bit of the encoded data. In one embodiment, the steps ofcheck operation specific condition and update are performed by thehardware accelerator 108. In one embodiment, steps of encoding andreturning are performed by the hardware accelerator 108. In oneembodiment, the pre-determined physical address range associated witheach of the plurality of buses 106A-N is associated with at least oneprocessing unit of the plurality of processing units 104A-N. The methodmay further include the one or more operation, the device address and amemory address location of the PARAM is encoded for generating theencoded data, upon size of the PARAM exceeding a pre-allocated number ofbits for the PARAM in the encoded data.

The memory address location corresponds to a pre-allocated memory forthe PARAM. In one embodiment, the pre-allocated memory is allocatedproportional to a number of concurrent requests during execution of theplurality of software threads by the hardware accelerator 108 at anypredetermined instance of time. The method may further include at leastone of (a) masking at least one interrupt on a processing unit fromamong the plurality of processing units 104A-N being accessed by thehardware accelerator 108, (b) writing into the pre-allocated memory forthe PARAM reserved for the processing unit, (c) performing a read orwrite operation to the hardware accelerator 108 and passing thepre-allocated memory as PARAM for the encoding, and (d) unmasking themasked interrupt.

The method may further include allocating the pre-allocated memory forthe PARAM based on a circular queue which includes at least one of (i)reading a dedicated hardware accelerator 108 to obtain a pre-allocatedmemory for the PARAM, (ii) writing into the pre-allocated memory for thePARAM reserved for the processing unit, (iii) performing, a read orwrite operation to the hardware accelerator 108 and passing thepre-allocated memory as PARAM, and writing the pre-allocated memory intothe dedicated hardware accelerator to release the pre-allocated memory.The dedicated hardware accelerator may be dedicated for PARAM memoryallocation.

There are no failures due to atomicity violation as the hardwareaccelerator 108 is built to process request one by one. There is a fixedupper bound on a time limit, which may be independent of a number ofsoftware threads in one example embodiment. The time limit is based onthe contention resolution method. For example, with a round robin schemethe time limit may be “Operation time” X “number of CPUs”. The memoryused may be independent of the number of software threads. The memory isa constant without the extension for larger PARAMs and is proportionalto the number of CPUs with the extension. This method converts alock-free algorithm to wait-free at a cost which grows at a rate lesserthan the number of software threads, without any degradation inperformance. This enables a specific partitioning and interfacingbetween software and hardware which are designed to eliminate atomicityviolations.

The foregoing description of the specific embodiments will so fullyreveal the general nature of the embodiments herein that others can, byapplying current knowledge, readily modify and/or adapt for variousapplications such specific embodiments without departing from thegeneric concept, and, therefore, such adaptations and modificationsshould and are intended to be comprehended within the meaning and rangeof equivalents of the disclosed embodiments. It is to be understood thatthe phraseology or terminology employed herein is for the purpose ofdescription and not of limitation. Therefore, while the embodimentsherein have been described in terms of preferred embodiments, thoseskilled in the art will recognize that the embodiments herein can bepracticed with modification within the spirit and scope of the appendedclaims.

What is claimed is:
 1. A method comprising: executing a plurality ofsoftware threads by a plurality of processing units associated, saidplurality of software threads is associated with at least one operation,wherein said plurality of processing units being communicativelyassociated with said hardware accelerator, wherein said at least oneoperation is one of a read request or a write request; generating atleast one of said read request or said write request at said hardwareaccelerator based on said execution; generating at least one operationcomprising PARAM and read request or write request at said hardwareaccelerator, wherein said hardware accelerator is associated with aplurality of buses, and wherein said hardware accelerator is accessibleto said plurality of software threads associated with said plurality ofprocessing units as a memory mapped device mapped into a pre-determinedphysical address range of each of said plurality of buses for ensuringcontention resolution among said plurality of buses; checking, anoperation specific condition of at least one software thread of saidplurality of software threads, wherein said operation specific conditioncomprises an operation success or an operation failure based on at leastone of said PARAM, said read request, or said write request; andupdating, at least one of read value or write value and at least onestate variable upon said operation specific condition being an operationsuccess.
 2. The method as claimed in claim 1, further comprisesperforming prior to checking said operation specific conditions:encoding said at least one of operation, and said device addressassociated with said read request to obtain an encoded data, whereinsaid encoded data is communicated to said hardware accelerator by saidplurality of software threads executed by said plurality of processingunits, and returning at least one of a failure value or a success valueof said at least one operation from said hardware accelerator to saidplurality of software threads on a plurality of data lines associatedwith said pre-determined physical address range, wherein lock freealgorithm is partitioned into said software and said hardware whereinsaid encoded data is passed from said software to said hardware andobtaining return encoded data from said hardware.
 3. The method of claim1, further comprises performing prior to checking said operationspecific conditions: encoding said at least one of operation, saidPARAM, said device address, and plurality of data lines associated withsaid write request to obtain an encoded data, wherein said encoded datais communicated to said hardware accelerator by said plurality ofsoftware threads executed by said plurality of processing units, whereinsaid lock-free algorithm is partitioned into said software and saidhardware, wherein said encoded data is passed from said software to saidhardware.
 4. The method of claim 1, wherein a contention within each ofsaid plurality of buses is resolved through one of an arbitrationprotocol and a starvation free priority resolution technique.
 5. Themethod of claim 3, wherein at least one of operations and said PARAM isencoded as a least significant bit of said encoded data.
 6. The methodof claim 1, wherein said steps of checking operation specific conditionand updating are performed by said hardware accelerator.
 7. The methodof claim 2, wherein said steps of encoding and returning are performedby said hardware accelerator.
 8. The method of claim 1, wherein saidpre-determined physical address range associated with each of saidplurality of buses is associated with at least one processing unit ofsaid plurality of processing units.
 9. The method of claim 3, furthercomprises encoding said at least one of operation, said device addressand a memory address location of said PARAM for generating said encodeddata, upon size of said PARAM exceeding a pre-allocated number of bitsfor said PARAM in said encoded data, and wherein said memory addresslocation corresponds to a pre-allocated memory for said PARAM.
 10. Themethod of claim 9, wherein said pre-allocated memory is allocatedproportional to a number of concurrent requests during execution of saidplurality of software threads by said hardware accelerator at anypredetermined instance of time.
 11. The method of claim 9, furthercomprises: (a) masking at least one interrupt on a processing unit fromamong said plurality of processing units being accessed by said hardwareaccelerator; (b) writing into said pre-allocated memory for said PARAMreserved for said processing unit; (c) performing a read or writeoperation to said hardware accelerator and passing said pre-allocatedmemory as PARAM for said encoding; and (d) unmasking said maskedinterrupt.
 12. The method of claim 9, further comprises, allocating saidpre-allocated memory for said PARAM based on a circular queuecomprising: reading a dedicated hardware accelerator to obtain apre-allocated memory for said PARAM, wherein said dedicated hardwareaccelerator is dedicated for PARAM memory allocation; writing into saidpre-allocated memory for said PARAM reserved for said processing unit;performing, a read or write operation to said dedicated hardwareaccelerator and passing said pre-allocated memory as PARAM; and writingsaid pre-allocated memory into said dedicated hardware accelerator torelease said pre-allocated memory.
 13. A hardware accelerator comprisinga dedicated digital logical circuit and memory storing at least oneVALUE and at least one STATE, wherein said dedicated digital logicalcircuit is configured to: process at least one of said read request orwrite request at said hardware accelerator upon execution of a pluralityof software threads by a plurality of processing units associated, saidplurality of software threads is associated with at least one operation;process at least one operation comprising PARAM and read request orwrite request at said hardware accelerator, wherein said hardwareaccelerator is associated with a plurality of buses, and wherein saidhardware accelerator is accessible to said plurality of software threadsassociated with said plurality of processing units as a memory mappeddevice mapped into a pre-determined physical address range of each ofsaid plurality of buses for ensuring contention resolution among saidplurality of buses; check an operation specific condition of at leastone software thread of said plurality of software threads, wherein saidoperation specific condition comprises an operation success or anoperation failure based on at least one of said PARAM, said readrequest, or said write request; and update at least one of: at least oneread VALUE or write VALUE and at least one STATE variable upon saidoperation specific condition being an operation success.
 14. Thehardware accelerator of claim 13, further configured to, perform priorto checking said operation specific conditions: decode said at least oneof operation, and said device address associated with said read requestto obtain an encoded data, wherein said encoded data is communicated tosaid hardware accelerator by said plurality of software threads executedby said plurality of processing units, and return at least one of afailure value or a success value of said at least one operation fromsaid hardware accelerator to said plurality of software threads on aplurality of data lines associated with said pre-determined physicaladdress range, wherein lock free algorithm is partitioned into saidsoftware and said hardware, wherein said encoded data is passed fromsaid software to said hardware and obtaining return encoded data fromsaid hardware.
 15. The hardware accelerator of claim 13, furtherconfigured to, perform prior to checking said operation specificconditions: decode said at least one of operation, said PARAM, saiddevice address, and plurality of data lines associated with said writerequest to obtain an encoded data, wherein said encoded data iscommunicated to said hardware accelerator by said plurality of softwarethreads executed by said plurality of processing units, wherein saidlock-free algorithm is partitioned into said software and said hardware,wherein decoded data is passed from said software to said hardware. 16.The hardware accelerator of claim 13, wherein a contention within eachof said plurality of buses is resolved through one of an arbitrationprotocol and a starvation free priority resolution technique.
 17. Thehardware accelerator of claim 15, wherein at least one of operation andsaid PARAM is encoded as a least significant bit of said encoded data.18. The hardware accelerator of claim 13, wherein said pre-determinedphysical address range associated with each of said plurality of busesis associated with at least one processing unit of said plurality ofprocessing units.
 19. The hardware accelerator of claim 15, furtherconfigured to, decode said at least one of operation, said deviceaddress and a memory address location of said PARAM for generating saidencoded data, upon size of said PARAM exceeding a pre-allocated numberof bits for said PARAM in said encoded data, and wherein said memoryaddress location corresponds to a pre-allocated memory for said PARAM.20. The hardware accelerator of claim 19, further configured to uponreceiving a read or write operation to said hardware accelerator passingsaid pre-allocated memory as PARAM for said encoding, performs a readoperation for retrieving said pre-allocated memory and use its contentsas PARAM for the requested operation.
 21. The hardware accelerator ofclaim 19, further configure to allocate said pre-allocated memory forsaid PARAM based on a circular queue comprising: read a dedicatedhardware accelerator to allocate a pre-allocated memory for said PARAM,wherein said dedicated hardware accelerator is dedicated for PARAMmemory allocation, write said pre-allocated memory into said dedicatedhardware accelerator to release said pre-allocated memory.
 22. Ahardware accelerator comprising a processor and a memory storinginstructions to execute on said processor, wherein said memory storingat least one VALUE and at least one STATE, wherein said processor isconfigured to: process at least one of said read request or writerequest at said hardware accelerator upon execution of a plurality ofsoftware threads by a plurality of processing units associated, saidplurality of software threads is associated with at least one operation;process at least one operation comprising PARAM and read request orwrite request at said hardware accelerator, wherein said hardwareaccelerator is associated with a plurality of buses, and wherein saidhardware accelerator is accessible to said plurality of software threadsassociated with said plurality of processing units as a memory mappeddevice mapped into a pre-determined physical address range of each ofsaid plurality of buses for ensuring contention resolution among saidplurality of buses; check an operation specific condition of at leastone software thread of said plurality of software threads, wherein saidoperation specific condition comprises an operation success or anoperation failure based on at least one of said PARAM, said readrequest, or said write request; and update at least one of read value,write value or at least one state variable upon said operation specificcondition being an operation success.